Bioinformatics (Thomas Dandekar, Meik Kunz)

109

function. For this purpose, one can bioinformatically perform a domain annotation, i.e.

which binding domains and functional sites are present, which thus provide information

about binding factors, but also the regulation and function of proteins. Databases such as

SMART, Prodom and Pfam provide information on proteins and domains and can also be

used to search a protein sequence for existing domains. Other important tools are the

BLAST algorithm, conserved domain or ELM servers, which allow the analysis and pre

diction of domains in unknown sequences.

Information on the metabolome (metabolism) can be obtained using mass spectroscopy

or gas chromatography. Metabolome sequencing is of interest to see how, for example,

metabolites change after a pathogenic infection or a drug, or how the metabolism of

humans and the pathogen differs. This is important, for example, for a potential pharma

ceutical to specifically affect the metabolism of a bacterium, but without producing a toxic

effect in humans. Important databases on biochemical metabolism include Roche

Biochemical Pathways, KEGG. The Metatool, YANA, YANAsquare or PLAS (Power Law

Analysis and Simulation) software are useful for investigating metabolism in more detail,

e.g. which metabolic fluxes are present or what effect changes in metabolic pathways have.

The large amounts of data that we can generate with modern techniques obviously help

much better to describe a biological system, such as the heart muscle.

On the other hand, it is clear that the crucial thing is to understand the underlying prin

ciples, as just explained for main and side effects and further illustrated by other central

system building blocks in this chapter. Therefore, one has two possibilities to describe a

complicated biological system:

First of all, knowledge-based research is used to elucidate the basic principles of the

biological system (for the myocardial cell in heart failure, see Figs. 5.1 and 5.2). Next, one

uses new data, preferably a great deal of it (nothing else is meant by “big data”), to sub

stantiate or modify the insights and hypotheses gained.

As you can see, relying only on the amount of data and large data sets is more a sign of

bias or inexperience. If I don’t have a clear hypothesis about the behavior of the system, I

have a much harder time reading the right thing from the data, or better yet, verifying it.

Even worse: “hypothesis free” research is mostly bad, even if advocates claim that one

would then be unbiased towards the results, because it is very easy to fall prey to chance.

Let’s illustrate this again with the gene expression dataset in heart failure. Let us assume

that we have measured 20,000 mRNAs and now want to understand, without a clear

hypothesis, which ones are increased in heart failure. Now, even if no objective differences

can be shown between drug and no drug, given 20,000 mRNAs, we would then purely by

chance find 1000 mRNAs that show a difference in expression between the two groups

with a p-value <0.05. Bioinformaticians and statisticians or experimenters, as experts in

large data sets, know this and therefore correct the statistics for such large data sets. This

is the correction for multiple testing, for example according to Bonferroni. In this correc

tion for many comparisons, the p-value is divided by the number of tests (n). For example,

for the 20,000 mRNAs, one would only accept differences with a p-value <0.0000025

(adjusted p-value). This is a very hard correction, but it applies to any distribution of

9.2 Opening Up Complex Systems Using Omics Techniques